Iteration 0¶

Phishing Link Detection Machine Learning¶

Jonathan Christyadi (502705) - AI Core 02

This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.

Dataset: https://data.mendeley.com/datasets/c2gw7fy2j4/3

In [ ]:
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__)     # 1.1.3
print("pandas version:", pd.__version__)            # 1.5.1
print("seaborn version:", seaborn.__version__)          # 0.12.1
scikit-learn version: 1.4.1.post1
pandas version: 2.2.1
seaborn version: 0.13.2

📦 Data provisioning¶

After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.

In [ ]:
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 phishing
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 phishing
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 phishing
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 legitimate
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 legitimate

5 rows × 87 columns

In [ ]:
# Taking a look at the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19431 entries, 0 to 19430
Data columns (total 87 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   id                          19431 non-null  object
 1   url                         19431 non-null  object
 2   url_length                  19431 non-null  object
 3   hostname_length             19431 non-null  object
 4   ip                          19431 non-null  object
 5   total_of.                   19431 non-null  object
 6   total_of-                   19431 non-null  object
 7   total_of@                   19431 non-null  object
 8   total_of?                   19431 non-null  object
 9   total_of&                   19431 non-null  object
 10  total_of=                   19431 non-null  object
 11  total_of_                   19431 non-null  object
 12  total_of~                   19431 non-null  object
 13  total_of%                   19431 non-null  object
 14  total_of/                   19431 non-null  object
 15  total_of*                   19431 non-null  object
 16  total_of:                   19431 non-null  object
 17  total_of,                   19431 non-null  object
 18  total_of;                   19431 non-null  object
 19  total_of$                   19431 non-null  object
 20  total_of_www                19431 non-null  object
 21  total_of_com                19431 non-null  object
 22  total_of_http_in_path       19431 non-null  object
 23  https_token                 19431 non-null  object
 24  ratio_digits_url            19431 non-null  object
 25  ratio_digits_host           19431 non-null  object
 26  punycode                    19431 non-null  object
 27  port                        19431 non-null  object
 28  tld_in_path                 19431 non-null  object
 29  tld_in_subdomain            19431 non-null  object
 30  abnormal_subdomain          19431 non-null  object
 31  nb_subdomains               19431 non-null  object
 32  prefix_suffix               19431 non-null  object
 33  random_domain               19431 non-null  object
 34  shortening_service          19431 non-null  object
 35  path_extension              19431 non-null  object
 36  nb_redirection              19431 non-null  object
 37  nb_external_redirection     19431 non-null  object
 38  length_words_raw            19431 non-null  object
 39  char_repeat                 19431 non-null  object
 40  shortest_words_raw          19431 non-null  object
 41  shortest_word_host          19431 non-null  object
 42  shortest_word_path          19431 non-null  object
 43  longest_words_raw           19431 non-null  object
 44  longest_word_host           19431 non-null  object
 45  longest_word_path           19431 non-null  object
 46  avg_words_raw               19431 non-null  object
 47  avg_word_host               19431 non-null  object
 48  avg_word_path               19431 non-null  object
 49  phish_hints                 19431 non-null  object
 50  domain_in_brand             19431 non-null  object
 51  brand_in_subdomain          19431 non-null  object
 52  brand_in_path               19431 non-null  object
 53  suspecious_tld              19431 non-null  object
 54  statistical_report          19431 non-null  object
 55  nb_hyperlinks               19431 non-null  object
 56  ratio_intHyperlinks         19431 non-null  object
 57  ratio_extHyperlinks         19431 non-null  object
 58  ratio_nullHyperlinks        19431 non-null  object
 59  nb_extCSS                   19431 non-null  object
 60  ratio_intRedirection        19431 non-null  object
 61  ratio_extRedirection        19431 non-null  object
 62  ratio_intErrors             19431 non-null  object
 63  ratio_extErrors             19431 non-null  object
 64  login_form                  19431 non-null  object
 65  external_favicon            19431 non-null  object
 66  links_in_tags               19431 non-null  object
 67  submit_email                19431 non-null  object
 68  ratio_intMedia              19431 non-null  object
 69  ratio_extMedia              19431 non-null  object
 70  sfh                         19431 non-null  object
 71  iframe                      19431 non-null  object
 72  popup_window                19431 non-null  object
 73  safe_anchor                 19431 non-null  object
 74  onmouseover                 19431 non-null  object
 75  right_clic                  19431 non-null  object
 76  empty_title                 19431 non-null  object
 77  domain_in_title             19431 non-null  object
 78  domain_with_copyright       19431 non-null  object
 79  whois_registered_domain     19431 non-null  object
 80  domain_registration_length  19431 non-null  object
 81  domain_age                  19431 non-null  object
 82  web_traffic                 19431 non-null  object
 83  dns_record                  19431 non-null  object
 84  google_index                19431 non-null  object
 85  page_rank                   19431 non-null  object
 86  status                      19431 non-null  object
dtypes: object(87)
memory usage: 12.9+ MB
In [ ]:
# Sampling the dataset
df.sample(10)
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
11777 3776 http://www.infoconcorso.it/ 27 19 1 2 0 0 0 0 ... 1 0 0 31 2525 0 0 0 2 legitimate
16148 8147 http://www.emi-shielding.net/ 29 21 1 2 1 0 0 0 ... 1 1 0 232 -1 0 0 0 3 legitimate
10214 2213 http://www.sooriyanfm.lk/ 25 17 1 2 0 0 0 0 ... 1 1 1 0 -1 215181 0 0 4 legitimate
8026 25 https://polarklimatsgserver.blogspot.com/ 41 32 1 2 0 0 0 0 ... 1 0 0 373 7296 0 0 1 5 phishing
3541 3541 http://www.attengo.co.il/ 25 17 0 3 0 0 0 0 ... 1 one 0 308 -1 2676195 0 0 2 legitimate
12146 4145 https://www.techwalla.com/articles/coaxial-cab... 70 17 1 2 5 0 0 0 ... 0 1 0 267 1925 2709 0 0 5 legitimate
3858 3858 https://www.bristol.gov.uk/documents/20182/164... 133 18 1 4 7 0 0 0 ... 1 One 0 0 -1 148470 0 0 5 legitimate
15147 7146 http://www.shoptopsllc.com/www.paypal.com.conf... 68 19 1 6 0 0 0 0 ... 1 0 0 468 1723 0 0 1 2 phishing
5667 5667 http://paypal-limited.pdcotton.com/signb/vhv0b2i 48 27 0 2 1 0 0 0 ... 1 one 0 369 4378 0 0 1 0 phishing
9663 1662 https://verif-main2d.blogspot.com 33 25 1 2 1 0 0 0 ... 1 0 0 372 7297 0 0 1 5 phishing

10 rows × 87 columns

Preprocessing¶

🆔 Encoding¶

After understanding the data on the sample, I found that some data are not in a good form and there is a room for improvement, such as the domain_with_copyright and status columns.

In [ ]:
df['status'].unique()
Out[ ]:
array(['phishing', 'legitimate'], dtype=object)

As you can see on the status column, there are only 2 values, phishing and legitimate. Which mean I can transform it into binary values (0 and 1).

In [ ]:
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 0 http://www.progarchives.com/album.asp?id=61737 46 20 0 3 0 0 1 0 ... 1 one 0 627 6678 78526 0 0 5 1
1 1 http://signin.eday.co.uk.ws.edayisapi.dllsign.... 128 120 0 10 0 0 0 0 ... 1 zero 0 300 65 0 0 1 0 1
2 2 http://www.avevaconstruction.com/blesstool/ima... 52 25 0 3 0 0 0 0 ... 1 zero 0 119 1707 0 0 1 0 1
3 3 http://www.jp519.com/ 21 13 0 2 0 0 0 0 ... 1 one 0 130 1331 0 0 0 0 0
4 4 https://www.velocidrone.com/ 28 19 0 2 0 0 0 0 ... 0 zero 0 164 1662 312044 0 0 4 0

5 rows × 87 columns

After a closer look, I spotted that there are some inconsistencies with the value on domain_with_copyright column, for example One and one. Similarly, I want to transform it into binary value 0 and 1, instead of the string

In [ ]:
df['domain_with_copyright'].unique()
Out[ ]:
array(['one', 'zero', 'One', 'Zero', '1', '0'], dtype=object)
In [ ]:
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
Out[ ]:
array([1, 0])

Checking null or NaN values¶

In [ ]:
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
In [ ]:
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
Out[ ]:
0

Making a function to check which feature contain binary values.

In [ ]:
# Finding columns with binary values

def count_binary_columns(df):
    results = []
    counter = 0
    for col in df.columns:
        counter += 1
        if df[col].isin([0, 1]).all():
            results.append(col)
    return results, counter


count_binary_columns(df)
Out[ ]:
(['domain_with_copyright', 'status'], 87)
In [ ]:
df = df.drop(columns=['id', 'url'])
df.head()
Out[ ]:
url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& total_of= total_of_ ... domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
0 46 20 0 3 0 0 1 0 1 0 ... 1 1 0 627 6678 78526 0 0 5 1
1 128 120 0 10 0 0 0 0 0 0 ... 1 0 0 300 65 0 0 1 0 1
2 52 25 0 3 0 0 0 0 0 0 ... 1 0 0 119 1707 0 0 1 0 1
3 21 13 0 2 0 0 0 0 0 0 ... 1 1 0 130 1331 0 0 0 0 0
4 28 19 0 2 0 0 0 0 0 0 ... 0 0 0 164 1662 312044 0 0 4 0

5 rows × 85 columns

In [ ]:
df['whois_registered_domain'].unique()
Out[ ]:
array(['0', '1'], dtype=object)
In [ ]:
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')    
status
0    9716
1    9715
Name: count, dtype: int64
Out[ ]:
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>

💡 Feature selection¶

A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.

Data Visualization¶

First, to determine which feature to be used on the model, I want to visualize the correlation of the features.

Creating a heatmap to visualize the correlation between the features¶

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)

Sorting the feature correlation values¶

To select the most suitable features for predicting the target variable (status), a heatmap was created to visualize the correlation between the features. By analyzing the heatmap, we can identify the features that have the highest positive or negative correlation with the target variable.

Features Plot Bar¶

Now I want to make a bar plot of the correlation with the target variable, which helps me to identify the important featueres, understand the relationship and simplify it.

In [ ]:
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
Out[ ]:
<Axes: title={'center': 'Correlation with the target variable'}>

Finding the most correlated features with the target variable based on numerical values¶

It can be seen on the plot bar that there are alot of features, I want to narrow it down by finding features with the most correlation in terms of numerical value.

In [ ]:
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
Out[ ]:
url_length hostname_length total_of. total_of- total_of? total_of/ total_of_www ratio_digits_url phish_hints nb_hyperlinks domain_in_title domain_with_copyright google_index page_rank status
status 0.244348 0.240681 0.205302 -0.102849 0.293920 0.240892 -0.444561 0.356587 0.337287 -0.341295 0.339519 -0.175469 0.730684 -0.509761 1.000000
google_index 0.233061 0.216919 0.208764 -0.018285 0.202097 0.289212 -0.357215 0.323157 0.279906 -0.269482 0.265933 -0.144499 1.000000 -0.386721 0.730684
ratio_digits_url 0.434626 0.171761 0.224194 0.110341 0.325739 0.206925 -0.211165 1.000000 0.096967 -0.128915 0.152393 -0.027357 0.323157 -0.181489 0.356587
domain_in_title 0.124224 0.218850 0.108442 0.009843 0.092191 0.088462 -0.178402 0.152393 0.125857 -0.217548 1.000000 0.076105 0.265933 -0.332742 0.339519
phish_hints 0.332000 -0.019901 0.168765 0.065562 0.208052 0.501321 -0.090812 0.096967 1.000000 -0.112423 0.125857 -0.066130 0.279906 -0.203464 0.337287
total_of? 0.523172 0.164129 0.353133 0.035958 1.000000 0.243749 -0.115337 0.325739 0.208052 -0.112604 0.092191 -0.046123 0.202097 -0.123151 0.293920
url_length 1.000000 0.217586 0.447198 0.406951 0.523172 0.486490 -0.067973 0.434626 0.332000 -0.098101 0.124224 -0.004281 0.233061 -0.099900 0.244348
total_of/ 0.486490 -0.061203 0.242216 0.204793 0.243749 1.000000 -0.005628 0.206925 0.501321 -0.073183 0.088462 -0.023213 0.289212 -0.113861 0.240892
hostname_length 0.217586 1.000000 0.406834 0.059480 0.164129 -0.061203 -0.130991 0.171761 -0.019901 -0.104614 0.218850 0.073107 0.216919 -0.160621 0.240681
total_of. 0.447198 0.406834 1.000000 0.049303 0.353133 0.242216 0.068290 0.224194 0.168765 -0.093994 0.108442 0.057320 0.208764 -0.098752 0.205302
total_of- 0.406951 0.059480 0.049303 1.000000 0.035958 0.204793 0.045756 0.110341 0.065562 -0.004513 0.009843 0.020914 -0.018285 0.104676 -0.102849
domain_with_copyright -0.004281 0.073107 0.057320 0.020914 -0.046123 -0.023213 0.087826 -0.027357 -0.066130 0.192159 0.076105 1.000000 -0.144499 0.057127 -0.175469
nb_hyperlinks -0.098101 -0.104614 -0.093994 -0.004513 -0.112604 -0.073183 0.114259 -0.128915 -0.112423 1.000000 -0.217548 0.192159 -0.269482 0.221066 -0.341295
total_of_www -0.067973 -0.130991 0.068290 0.045756 -0.115337 -0.005628 1.000000 -0.211165 -0.090812 0.114259 -0.178402 0.087826 -0.357215 0.110745 -0.444561
page_rank -0.099900 -0.160621 -0.098752 0.104676 -0.123151 -0.113861 0.110745 -0.181489 -0.203464 0.221066 -0.332742 0.057127 -0.386721 1.000000 -0.509761

Displaying the top correlated features along with their correlation values.¶

On the left side is the feature name and on the right side is the correlation values which indicates the strength and direction of the correlation between each features and target variable

In [ ]:
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
Out[ ]:
status                   1.000000
google_index             0.730684
ratio_digits_url         0.356587
domain_in_title          0.339519
phish_hints              0.337287
total_of?                0.293920
url_length               0.244348
total_of/                0.240892
hostname_length          0.240681
total_of.                0.205302
total_of-               -0.102849
domain_with_copyright   -0.175469
nb_hyperlinks           -0.341295
total_of_www            -0.444561
page_rank               -0.509761
Name: status, dtype: float64

Selecting the features¶

Now I can utilize the features (except the target variable) with the most correlation into the model.

In [ ]:
# List the features from the previous step into a list
selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')

# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)

# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)

# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index               int64
ratio_digits_url         float64
domain_in_title            int64
phish_hints                int64
total_of?                  int64
url_length                 int64
total_of/                  int64
hostname_length            int64
total_of.                  int64
total_of-                  int64
domain_with_copyright      int32
nb_hyperlinks              int64
total_of_www               int64
page_rank                  int64
dtype: object
int64
Out[ ]:
google_index ratio_digits_url domain_in_title phish_hints total_of? url_length total_of/ hostname_length total_of. total_of- domain_with_copyright nb_hyperlinks total_of_www page_rank status
0 0 0.108696 1 0 1 46 3 20 3 0 1 143 1 5 1
1 1 0.054688 1 2 0 128 3 120 10 0 0 0 0 0 1
2 1 0.000000 1 0 0 52 4 25 3 0 0 3 1 0 1
3 0 0.142857 1 0 0 21 3 13 2 0 1 404 1 0 0
4 0 0.000000 0 0 0 28 3 19 2 0 0 57 1 4 0
In [ ]:
# Count the number of binary columns in the selected features
features_binary = count_binary_columns(df[selected_features])
features_binary
Out[ ]:
(['google_index', 'domain_in_title', 'domain_with_copyright'], 14)

Data Scaling¶

Now I scale the data appropriately.

In [ ]:
# from sklearn.preprocessing import StandardScaler
# # Scale the data
# selected_df = selected_df.dropna()
# scaler = StandardScaler()
# selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])

Pairplot¶

Visualize the correlations, distributions, and patterns between multiple variables in the dataset.

In [ ]:
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')

# Add legends
plt.legend(title='Status', labels=['Phishing', 'Legitimate'])

# Show the plot
plt.show()

Defining target variable and feature variables¶

In this section I want to split the target and feature variables into X and y.

In [ ]:
target = 'status'

X = df[selected_features]
y = df[target]

🪓 Splitting into train/test¶

Splitting the train and test set 80% and 20% respectively. So around 15.5k are in train set and 4k in test set.

In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.

🧬 Modelling¶

In this section, I want to try a few different models and how they perform compared to other models. Also, at the end I will stack some of the models.

Support Vector Machine¶

This code trains a Support Vector Machine (SVM) classifier, a powerful algorithm used for classification tasks. The SVM learns to classify data points into different categories based on their features.

In [ ]:
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
SVM = SVC()
SVM.fit(X_train, y_train)
SVM_score = SVM.score(X_test, y_test)
print("Accuracy:", SVM_score)
Accuracy: 0.8322613841008489

This code generates a classification report for the predictions made by a Support Vector Machine (SVM) model.

In [ ]:
from sklearn.metrics import classification_report
predictions = SVM.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
              precision    recall  f1-score   support

           0       0.85      0.81      0.83      1935
           1       0.82      0.85      0.84      1952

    accuracy                           0.83      3887
   macro avg       0.83      0.83      0.83      3887
weighted avg       0.83      0.83      0.83      3887

Linear Regression¶

This code trains a Linear Regression model, which is a simple method used for predicting numeric values based on input features.

In [ ]:
# LINEAR REGRESSION

from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(X_train, y_train)
linear_score = linear.score(X_test, y_test)
print("R²:", linear_score)
R²: 0.6912513253784323
In [ ]:
predictions_linear = linear.predict(X_test)
report_linear = classification_report(y_test, predictions_linear.round())
print(report_linear)
              precision    recall  f1-score   support

        -1.0       0.00      0.00      0.00         0
         0.0       0.91      0.91      0.91      1935
         1.0       0.91      0.91      0.91      1952
         2.0       0.00      0.00      0.00         0

    accuracy                           0.91      3887
   macro avg       0.46      0.46      0.46      3887
weighted avg       0.91      0.91      0.91      3887

C:\Users\jochr\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
C:\Users\jochr\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
C:\Users\jochr\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\metrics\_classification.py:1509: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

🏘️ K-NEAREST NEIGBOURS¶

This code implements the K-Nearest Neighbors (KNN) classification algorithm. KNN works by finding the 'k' nearest data points in the training set to a given input, and the majority class among those neighbors is assigned to the input.

In [ ]:
# K-NEAREST NEIGHBORS

from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(X_train, y_train)
KNN_score = KNN.score(X_test, y_test)
print("Accuracy:", KNN_score)
Accuracy: 0.9084126575765372
In [ ]:
predictions_KNN = KNN.predict(X_test)
report_KNN = classification_report(y_test, predictions_KNN)
print(report_KNN)
              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1935
           1       0.95      0.86      0.90      1952

    accuracy                           0.91      3887
   macro avg       0.91      0.91      0.91      3887
weighted avg       0.91      0.91      0.91      3887

🌲Decision Tree¶

This code trains a decision tree classifier, a type of machine learning model used for classification tasks. Then, it evaluates the accuracy of the model on test data and prints the accuracy score.

In [ ]:
# DECISION TREE

from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
decision_tree.fit(X_train, y_train)
DT_score = decision_tree.score(X_test, y_test)
print("Accuracy:", DT_score)
Accuracy: 0.9274504759454593
In [ ]:
predictions_DT = decision_tree.predict(X_test)
report_DT = classification_report(y_test, predictions_DT)
print(report_DT)
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      1935
           1       0.93      0.93      0.93      1952

    accuracy                           0.93      3887
   macro avg       0.93      0.93      0.93      3887
weighted avg       0.93      0.93      0.93      3887

This code visualizes a decision tree model using a graphical representation. It sets up target names for the classes ("phishing" and "legitimate")

In [ ]:
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(decision_tree, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()

🌳 Decision Tree with ADA Boosting¶

Boost the decision tree with ADA boosting and turns out, the performance (accuracy) increases by 2%.

In [ ]:
# AdaBoost with decision trees
from sklearn.ensemble import AdaBoostRegressor
adaboost_decision_tree = AdaBoostRegressor(estimator=decision_tree, n_estimators=50, random_state=21)
X_train = X_train.astype(float) 
y_train = y_train.astype(float)
adaboost_decision_tree.fit(X_train, y_train)
ada_dt_score = adaboost_decision_tree.score(X_test, y_test)
print("Accuracy:", ada_dt_score)
Accuracy: 0.9413418159867836
In [ ]:
predictions_ada_dt = adaboost_decision_tree.predict(X_test)
report_ada_dt = classification_report(y_test, predictions_ada_dt.round())
print(report_ada_dt)
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1935
           1       0.99      0.98      0.99      1952

    accuracy                           0.99      3887
   macro avg       0.99      0.99      0.99      3887
weighted avg       0.99      0.99      0.99      3887

🌳🌳🌳 Random Forest Regressor¶

This code uses a machine learning method called Random Forest, which creates a powerful model by combining many decision trees. It trains this model with training data and evaluates its accuracy with test data, displaying the accuracy score.

In [ ]:
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor(n_estimators = 500, max_depth=25, n_jobs=-1)
random_forest.fit(X_train, y_train)
rf_score = random_forest.score(X_test, y_test)
print("Accuracy:", rf_score)
Accuracy: 0.9337909630387703
In [ ]:
predictions_rf = random_forest.predict(X_test)
report_rf = classification_report(y_test, predictions_rf.round())
print(report_rf)
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1935
           1       0.98      0.98      0.98      1952

    accuracy                           0.98      3887
   macro avg       0.98      0.98      0.98      3887
weighted avg       0.98      0.98      0.98      3887

🌳🌳🌳 Random Forest with ADA Boosting¶

This code sets up a method called AdaBoost with Random Forest, a technique that boosts the performance of a random forest model. It trains this boosted model with training data and evaluates its accuracy with test data, displaying the accuracy score.

In [ ]:
# AdaBoost with Random Forest
from sklearn.ensemble import AdaBoostRegressor

adaboost_random_forest = AdaBoostRegressor(estimator=random_forest, n_estimators=50, random_state=21)
adaboost_random_forest.fit(X_train, y_train)
ada_rf_score = adaboost_random_forest.score(X_test, y_test)
print("Accuracy:", ada_rf_score)
Accuracy: 0.9518304104761641
In [ ]:
predictions_ada_rf = adaboost_random_forest.predict(X_test)
report_ada_rf = classification_report(y_test, predictions_ada_rf.round())
print(report_ada_rf)
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1935
           1       0.99      0.98      0.98      1952

    accuracy                           0.98      3887
   macro avg       0.98      0.98      0.98      3887
weighted avg       0.98      0.98      0.98      3887

Apply Stacking¶

This code combines different prediction methods (like linear regression and random forest) into one super method called Stacking Regressor. It learns from data and gives a score showing how accurate its predictions are.

In [ ]:
from sklearn.ensemble import StackingRegressor

# A list of tuples with the name of the model and the model itself
estimators_list = [
    ('linear_regression', linear),
    ('random_forest', random_forest),
    ('adaboost', adaboost_decision_tree),
    ('adaboost_random_forest', adaboost_random_forest)
]

stacking_regressor = StackingRegressor(estimators=estimators_list, final_estimator=RandomForestRegressor(n_estimators=50, max_depth=25, n_jobs=-1))
stacking_regressor.fit(X_train, y_train)
stack_regressor_score = stacking_regressor.score(X_test, y_test)
print("Accuracy:", stack_regressor_score)
Accuracy: 0.9483017638835939
In [ ]:
predictions_stack_regressor = stacking_regressor.predict(X_test)
report_stack_regressor = classification_report(y_test, predictions_stack_regressor.round())
print(report_stack_regressor)
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1935
           1       0.99      0.98      0.98      1952

    accuracy                           0.98      3887
   macro avg       0.98      0.98      0.98      3887
weighted avg       0.98      0.98      0.98      3887

Comparing the performance of different models.¶

It prints a comparison report displaying each model's performance score. Finally, it identifies the best performing model by finding the model with the highest score and prints its name along with its score.

In [ ]:
# List of models and their scores
model_scores = {
    "Linear Regression": linear_score,
    "Decision Tree": DT_score,
    "Random Forest": rf_score,
    "K-Nearest Neighbors": KNN_score,
    "Support Vector Machine (SVM)": SVM_score,
    "Decison Tree with AdaBoost": ada_dt_score,
    "Random Forest with AdaBoost": ada_rf_score,
    "Stacking Regressor": stack_regressor_score
}

# Print comparison report
print("Model Comparison Report:")
print("=========================")
for model, score in model_scores.items():
    print(f"{model}: {score:.4f}")

# Find the best performing model
best_model = max(model_scores, key=model_scores.get)
print(f"\nThe best performing model is: {best_model} with a score of {model_scores[best_model]:.4f}")
Model Comparison Report:
=========================
Linear Regression: 0.6913
Decision Tree: 0.9275
Random Forest: 0.9338
K-Nearest Neighbors: 0.9084
Support Vector Machine (SVM): 0.8323
Decison Tree with AdaBoost: 0.9413
Random Forest with AdaBoost: 0.9518
Stacking Regressor: 0.9483

The best performing model is: Random Forest with AdaBoost with a score of 0.9518
In [ ]:
# Define the classification reports for each model
reports = {
    'AdaBoost Decision Tree': report_ada_dt,
    'AdaBoost Random Forest': report_ada_rf,
    'Decision Tree': report_DT,
    'Random Forest': report_rf,
    'K-Nearest Neighbors': report_KNN,
    'Support Vector Machine (SVM)': report,
    'Linear Regression': report_linear,
}

# Print the comparison report
for model, report in reports.items():
    print(f"Classification Report for {model}:")
    print(report)
Classification Report for AdaBoost Decision Tree:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1935
           1       0.99      0.98      0.99      1952

    accuracy                           0.99      3887
   macro avg       0.99      0.99      0.99      3887
weighted avg       0.99      0.99      0.99      3887

Classification Report for AdaBoost Random Forest:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1935
           1       0.99      0.98      0.98      1952

    accuracy                           0.98      3887
   macro avg       0.98      0.98      0.98      3887
weighted avg       0.98      0.98      0.98      3887

Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      1935
           1       0.93      0.93      0.93      1952

    accuracy                           0.93      3887
   macro avg       0.93      0.93      0.93      3887
weighted avg       0.93      0.93      0.93      3887

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1935
           1       0.98      0.98      0.98      1952

    accuracy                           0.98      3887
   macro avg       0.98      0.98      0.98      3887
weighted avg       0.98      0.98      0.98      3887

Classification Report for K-Nearest Neighbors:
              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1935
           1       0.95      0.86      0.90      1952

    accuracy                           0.91      3887
   macro avg       0.91      0.91      0.91      3887
weighted avg       0.91      0.91      0.91      3887

Classification Report for Support Vector Machine (SVM):
              precision    recall  f1-score   support

        -1.0       0.00      0.00      0.00         0
         0.0       0.91      0.91      0.91      1935
         1.0       0.91      0.91      0.91      1952
         2.0       0.00      0.00      0.00         0

    accuracy                           0.91      3887
   macro avg       0.46      0.46      0.46      3887
weighted avg       0.91      0.91      0.91      3887

Classification Report for Linear Regression:
              precision    recall  f1-score   support

        -1.0       0.00      0.00      0.00         0
         0.0       0.91      0.91      0.91      1935
         1.0       0.91      0.91      0.91      1952
         2.0       0.00      0.00      0.00         0

    accuracy                           0.91      3887
   macro avg       0.46      0.46      0.46      3887
weighted avg       0.91      0.91      0.91      3887

Iteration 1¶

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

📦 Data provisioning¶

After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.

In [ ]:
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.sample(5)
Out[ ]:
id url url_length hostname_length ip total_of. total_of- total_of@ total_of? total_of& total_of= total_of_ total_of~ total_of% total_of/ total_of* total_of: total_of, total_of; total_of$ total_of_www total_of_com total_of_http_in_path https_token ratio_digits_url ratio_digits_host punycode port tld_in_path tld_in_subdomain abnormal_subdomain nb_subdomains prefix_suffix random_domain shortening_service path_extension nb_redirection nb_external_redirection length_words_raw char_repeat shortest_words_raw shortest_word_host shortest_word_path longest_words_raw longest_word_host longest_word_path avg_words_raw avg_word_host avg_word_path phish_hints domain_in_brand brand_in_subdomain brand_in_path suspecious_tld statistical_report nb_hyperlinks ratio_intHyperlinks ratio_extHyperlinks ratio_nullHyperlinks nb_extCSS ratio_intRedirection ratio_extRedirection ratio_intErrors ratio_extErrors login_form external_favicon links_in_tags submit_email ratio_intMedia ratio_extMedia sfh iframe popup_window safe_anchor onmouseover right_clic empty_title domain_in_title domain_with_copyright whois_registered_domain domain_registration_length domain_age web_traffic dns_record google_index page_rank status
6830 6830 https://www.bankwest.com.au/personal/bank/bank... 55 19 0 3 1 0 0 0 0 0 0 0 5 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 1 0 6 4 3 3 4 8 8 8 5.833333333 5.5 6 0 0 0 0 0 0 202 0.900990099 0.099009901 0 0 0 0.15 0 0 0 0 100 0 100 0 0 0 0 66.10169492 0 0 0 0 one 1 0 -1 16455 0 1 5 legitimate
13662 5661 http://www.huanqiucaijing.cn/wp-admin/en/B/ 43 21 1 2 1 0 0 0 0 0 0 0 6 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 2 1 1 0 0 1 0 6 3 1 3 1 14 14 5 4.5 8.5 2.5 2 0 0 0 0 0 20 0.95 0.05 0 0 0 0 0 1 0 0 0 0 100 0 0 0 0 93.75 0 0 0 1 1 0 1091 -1 0 0 1 0 phishing
9448 1447 https://www.newegg.com/Product/ProductList.asp... 92 14 1 3 0 0 1 1 2 0 0 2 4 0 1 0 0 0 1 0 0 0 0.043478261 0 0 0 0 0 0 3 0 0 0 0 1 0 11 5 3 3 3 11 6 11 6.363636364 4.5 6.777777778 0 0 0 0 0 0 117 0.068376068 0.931623932 0 3 0 0.018348624 0 0.009174312 0 1 0 0 0 100 0 0 0 72.5 0 0 0 0 0 0 1653 7479 629 0 0 7 legitimate
11821 3820 https://www.tumblr.com/safe-mode?url=https%3A%... 96 14 1 4 1 0 1 0 1 0 0 4 3 0 1 0 0 0 1 1 1 0 0.041666667 0 0 0 1 0 0 3 0 0 0 0 0 0 12 4 2 3 2 33 6 33 6.083333333 4.5 6.4 0 1 0 0 0 0 5 0.4 0.6 0 0 0 0.666666667 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 1050 5158 114 0 1 8 legitimate
11188 3187 https://www.autolikesfree.net/ 30 21 1 2 0 0 0 0 0 0 0 0 3 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 4 3 3 0 13 13 0 8 8 0 0 0 0 0 0 0 48 0.583333333 0.416666667 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 75 0 0 0 1 1 0 163 933 249971 0 0 3 phishing
In [ ]:
df['avg_word_path'].max()
Out[ ]:
'96.22222222'

Preprocessing¶

In [ ]:
df['page_rank'].unique()
Out[ ]:
array(['5', '0', '4', '2', '10', '6', '7', '3', '8', '1', '9'],
      dtype=object)

Checking null values¶

In [ ]:
df.isna().sum().sum() 
Out[ ]:
0
In [ ]:
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
In [ ]:
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
In [ ]:
df.drop(columns=['id', 'url'], inplace=True)

💡 Feature selection¶

In [ ]:
columns = ['url_length', 'hostname_length', 'ip', 'total_of.', 'total_of-', 'ratio_digits_url', 'ratio_digits_host', 'avg_words_raw', 'avg_word_host', 'avg_word_path', 'domain_registration_length', 'domain_age', 'web_traffic']

for column in columns:
    data = df[column].dropna()

    data_float = [float(val) for val in data]

    # Define the number of bins
    num_bins = 20

    # Calculate the bin edges (20 evenly spaced bins)
    bin_edges = np.linspace(min(data_float), max(data_float), num_bins + 1)

    # Create the histogram
    plt.figure(figsize=(10, 6))
    plt.hist(data_float, bins=bin_edges, alpha=0.7, edgecolor='black')

    # Add labels and title
    plt.xlabel('Value Range')
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {column} with {num_bins} Bins')

    # Add frequency counts on top of each bar
    bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
    for count, x in zip(np.histogram(data_float, bins=bin_edges)[0], bin_centers):
        plt.text(x, count, str(count), ha='center', va='bottom')

    # Show plot
    plt.show()
In [ ]:
corr = df.corr()
corr.sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
Out[ ]:
<Axes: title={'center': 'Correlation with the target variable'}>
In [ ]:
new_df = df.loc[:, ['url_length', 'hostname_length', 'ip', 'total_of.', 'total_of-', 'ratio_digits_url', 'ratio_digits_host', 'avg_words_raw', 'avg_word_host', 'avg_word_path', 'domain_registration_length', 'domain_age', 'web_traffic']]
new_df.sample(1)
Out[ ]:
url_length hostname_length ip total_of. total_of- ratio_digits_url ratio_digits_host avg_words_raw avg_word_host avg_word_path domain_registration_length domain_age web_traffic
9802 49 25 1 2 0 0 0 11.33333333 21 6.5 217 1975 0
In [ ]:
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df.hist(bins = 50, figsize = (30,30))
plt.show()
In [ ]:
plt.figure(figsize=(15,13))
sns.heatmap(df.corr())
plt.show()

Correlation matrix¶